NaN problem of nn.MultiHeadAttention in PyTorch

1 minute read

Published:

MultiHeadAttention NaN 问题

nn.MultiheadAttention causes gradients to become NaN under some use cases · Issue #41508 · pytorch/pytorch · GitHub 这几天持续跟踪了一下 pytorch 实现的 nn.MultiHeadAttention 计算过程中出现 NaN 的问题。根本原因是 tokenizer 在左侧增加 padding token(只能在左侧加,在右侧加是错误的,LLM 自回归生成,无法跟在 padding token 后面继续生成),导致 causal mask 和 padding mask 合并之后存在 attention matrix 前几行整行被 mask 的情况。pytorch 对于被 mask 部分的处理方式是填充 float("-inf"),导致经过 softmax 计算之后,整行都是 NaN

解决方案

  1. tokenizer 从右侧增加 padding token,不可行
  2. pytorch 支持2种数据类型的 mask: float mask 和 bool mask,会自动对 bool mask 填充为 float("-inf"),而 float mask 不进行修改。所以可以自行将需要 padding 的位置填充好一个很小的值,比如 -1e8
  3. 在 forward 阶段不对 padding token 进行任何处理,只在计算 loss 时忽略 padding token,即 loss_fn = nn.CrossEntropyLoss(ignore_index=tokenizer.pad_token_id)

    其他注意到的事

  4. qwen 的 tokenizer.padding_side=”right”,导致使用 batch 进行 completion 时,较短的句子,即增加了 padding token 的句子,completion 效果极差。如下 ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer

model_pth = “qwen/Qwen2-1.5B-Instruct” model = AutoModelForCausalLM.from_pretrained(model_pth, torch_dtype=torch.bfloat16, device_map=”cuda”) tokenizer = AutoTokenizer.from_pretrained(model_pth)

inference

in_text = [“Beijing is the captical of “, “Apple”] in_tokens = tokenizer(in_text, return_tensors=”pt”, padding=True)[“input_ids”] print(“bos:”, tokenizer.bos_token_id, tokenizer.bos_token) # bos: None None print(“eos:”, tokenizer.eos_token_id, tokenizer.eos_token) # eos: 151645 <|im_end|> print(“pad:”, tokenizer.pad_token_id, tokenizer.pad_token) # pad: 151643 <|endoftext|> print(“in_tokens:”) for x in in_tokens: print(x)

tensor([ 3430, 23649, 374, 279, 6427, 938, 315, 220])

tensor([ 26567, 151643, 151643, 151643, 151643, 151643, 151643, 151643])

outputs = model.generate(in_tokens.to(“cuda”)) for x in outputs: print(x)

tensor([ 3430, 23649, 374, 279, 6427, 938, 315, 220, 30743,

16, 2130, 323, 220, 17, 15, 339, 23631, 13,

576, 3283], device=’cuda:0’)

tensor([ 26567, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 2,

16, 17, 18, 271, 22599, 14363, 1965, 7591, 62,

21, 19], device=’cuda:0’)

output_text = tokenizer.batch_decode(outputs) for x in outputs_text: print(x)

Beijing is the captical of __1__ Beijing. The city has a population of over

Apple<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>## 50. If (a) is a positive


2. deepseek 的模型采用的似乎是 Llama 模型,不清楚是不是完全没改模型结构。
```python
import torch
from transformers import AutoModelForCausalLM

model_pth = "deepseek-ai/deepseek-coder-1.3b-instruct"
model = AutoModelForCausalLM.from_pretrained(model_pth, torch_dtype=torch.bfloat16, device_map="cuda")
print(type(model)) # <class 'transformers.models.llama.modeling_llama.LlamaForCausalLM'>